Triton 編程入門：從急切運算到基於區塊的平行運算

從 PyTorch 急切模式 轉換至 Triton 需要將張量視為單一整體物件的觀念，轉變為將其視為可分割、易於管理的區塊或稱為瓦片。

1. PyTorch 與 Triton 張量

必須清楚區分 Triton 張量與 PyTorch 張量。PyTorch 張量是 主機端的 Python 物件 包裝了形狀、資料類型、裝置、步幅及儲存元資料的物件。相較之下，Triton 使用的是特定記憶體區塊內的 原始資料指標 以進行更底層的優化。

2. 態急模式的瓶頸

在標準的急切執行中，每一個運算（例如加法後接 ReLU）都需要獨立啟動一個核心，並進行一次 全域記憶體往返。這正是現代 GPU 計算的主要瓶頸。Triton 透過在單一核心內融合多個運算來克服此問題，該核心會直接在晶片內部記憶體中處理資料區塊（例如 128、256 或 512 個元素）。融合在單一核心內處理資料區塊（例如 128、256 或 512 個元素），並直接在晶片記憶體中運作。

3. 區塊導向的範式

與傳統 CUDA 線程的標量級思維不同，Triton 改用 SPMD（單一程式，多重資料） 於區塊層級。您只需撰寫一個核心，Triton 便會在整個網格上啟動多個實例。每個實例利用其 program_id 來計算它所擁有的「資料區塊」是哪一部分。

4. 環境設定

開始前，請 在乾淨的環境中安裝 Triton （使用 Conda 或 venv）以確保不會與現有的 CUDA 工具包產生相依性衝突： pip install triton。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the primary difference between a PyTorch tensor and a Triton tensor within a kernel?

Triton tensors contain Python metadata like strides; PyTorch tensors are raw pointers.

A PyTorch tensor is a host-side object wrapping metadata; a Triton tensor represents blocks of data processed at the compiler level.

There is no difference; they are the same object.

Triton tensors are stored on the CPU, while PyTorch tensors are on the GPU.

QUESTION 2

Why is 'Eager Mode' considered a bottleneck for modern GPU performance?

Because it uses too much CPU memory.

Every operation requires a separate kernel launch and a global memory round-trip.

It cannot handle floating-point numbers.

It lacks support for the Python language.

QUESTION 3

What is the result of installing Triton in a 'dirty' environment with conflicting CUDA toolkits?

Triton will automatically fix the CUDA path.

It may lead to library version mismatches and kernel compilation errors.

The GPU will run faster due to multiple toolkit options.

Triton does not use CUDA, so there is no conflict.

QUESTION 4

Draw the mapping from pid to index range for N=1000, BLOCK_SIZE=256.

pid 0: [0, 256); pid 1: [256, 512); pid 2: [512, 768); pid 3: [768, 1000)

pid 0: [0, 1000)

pid 0: [0, 256); pid 1: [257, 512); pid 2: [513, 768); pid 3: [769, 1000)

pid 1: [0, 256); pid 2: [256, 512); pid 3: [512, 768); pid 4: [768, 1000)

QUESTION 5

In block-based parallelism, the instruction shift moves from 'compute one element' to:

'Compute one entire tensor'.

'Compute one block of 128/256/512 elements'.

'Compute one scalar at a time'.

'Let the CPU handle the math'.